Qualcomm AI Engine Direct - Add MHA2SHA pass #15438

shewu-quic · 2025-10-29T06:56:12Z

Background

We observed that quantizing and compiling the original sha model requires a significant amount of time. Switching to the mha model speeds up this process. Therefore, we investigated whether converting the mha model after quantization is feasible. However, we cannot perform this conversion during the to_edge transformation, as splitting the convolution weights to sha would require modifying the state_dict, which is not permitted at that stage. Therefore, we decided to apply this pass during qnn_preprocess.

Summary:

Integrated mha into sha pass and implemented it in qnn_preprocess
Refactored mha in static llama
- Included spin quant r3 support and masked softmax for MHA model in static llama
- Combined the n_heads key-value cache into a single cache for each layer to decrease the number of inputs and outputs, which enhances performance.
Deprecated ShiftPointer kv updater mode
- Since each layer now has its own kv cache, the v cache no longer benefits from ShiftPointer, which previously avoided copying the new v cache to the input v cache. To prevent user confusion, ShiftPointer mode has been deprecated
Applied the correct input template for smollm2 135m
Correct the quantization annotation for reshape
Remove outdated code from CanonicalizeConv

Results

Follow README setting, test on SM8750 with QNN 2.37. Compared the new pass convert_mha_to_sha with original sha structure

pytorch-bot · 2025-10-29T06:56:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15438

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm failures during provisioning step due to network issues

✅ You can merge normally! (4 Unrelated Failures)

As of commit 601c14c with merge base ca4c575 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / android / run-emulator (gh) (trunk failure)
The process '/usr/bin/sh' failed with exit code 255
pull / test-binary-size-linux-gcc / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-openvino-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-setup-linux-gcc / linux-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

shewu-quic · 2025-10-29T06:57:02Z

@pytorchbot label "release notes: qualcomm"

shewu-quic · 2025-10-30T00:10:46Z

Hi @cccclai,
This PR is to migrate mha2sha transformation from source level to a pass which apply on qnn_preprocess. It can significantly improve lowering time including quantization and compilation time.
Could you please take a look?

Thanks

cccclai · 2025-10-31T18:18:11Z

Hi, since it's a really big change, and MHA2SHA pass seems complicated, can you add a test for the pass here https://github.com/pytorch/executorch/blob/main/backends/qualcomm/tests/test_passes.py passes can be fragile, so I'm trying to make sure we have it cover in tests

shewu-quic · 2025-11-03T10:02:17Z

Hi, since it's a really big change, and MHA2SHA pass seems complicated, can you add a test for the pass here https://github.com/pytorch/executorch/blob/main/backends/qualcomm/tests/test_passes.py passes can be fragile, so I'm trying to make sure we have it cover in tests

Thanks for pointing up. I have added a test case to check the functionality of MHA2SHA.

cccclai · 2025-11-03T17:34:20Z

backends/qualcomm/tests/test_passes.py

+            if n.target == exir_ops.edge.aten.convolution.default
+        ]
+        # Check graph structure: WQ, WK, WV should be converted to SHA
+        self.assertTrue(len(conv_nodes) == 25, "Convolution nodes should be splited")


Thanks for adding the test! Is it possible to check if the numeric are the same?

Sure. I have added it. Thanks!

Summary: - Integrated mha into sha pass and implemented it in qnn_preprocess - Refactored mha in static llama - Added support for masked softmax - Included spin quant r3 support - Combined the n_heads key-value cache into a single cache for each layer to decrease the number of inputs and outputs, which enhances performance. - Deprecated ShiftPointer kv updater mode - Since each layer now has its own kv cache, the v cache no longer benefits from ShiftPointer, which previously avoided copying the new v cache to the input v cache. To prevent user confusion, ShiftPointer mode has been deprecated - Applied the correct input template for smollm2 135m - Corrected the quantization annotation for reshape - Remove outdated code from CanonicalizeConv

shewu-quic requested a review from cccclai as a code owner October 29, 2025 06:56

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 29, 2025

pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Oct 29, 2025

shewu-quic force-pushed the dev1/hutton/add_mha_to_sha_pass branch from 18e7db1 to 0a666d2 Compare November 3, 2025 10:00

cccclai reviewed Nov 3, 2025

View reviewed changes

shewu-quic added 4 commits November 4, 2025 12:41

update bc storyllama pte

b199e88

Add a test to check functionality of ConvertMhaToSha

a666afa

Check numeric result with reference output

0b33455

shewu-quic force-pushed the dev1/hutton/add_mha_to_sha_pass branch from 0a666d2 to 0b33455 Compare November 4, 2025 05:13

Fixed rebase failure

601c14c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qualcomm AI Engine Direct - Add MHA2SHA pass #15438

Qualcomm AI Engine Direct - Add MHA2SHA pass #15438

Uh oh!

shewu-quic commented Oct 29, 2025

Uh oh!

pytorch-bot bot commented Oct 29, 2025 •

edited

Loading

Uh oh!

shewu-quic commented Oct 29, 2025

Uh oh!

shewu-quic commented Oct 30, 2025

Uh oh!

cccclai commented Oct 31, 2025

Uh oh!

shewu-quic commented Nov 3, 2025

Uh oh!

cccclai Nov 3, 2025

Uh oh!

shewu-quic Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Qualcomm AI Engine Direct - Add MHA2SHA pass #15438

Are you sure you want to change the base?

Qualcomm AI Engine Direct - Add MHA2SHA pass #15438

Uh oh!

Conversation

shewu-quic commented Oct 29, 2025

Background

Summary:

Results

Uh oh!

pytorch-bot bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15438

❗ 1 Active SEVs

✅ You can merge normally! (4 Unrelated Failures)

Uh oh!

shewu-quic commented Oct 29, 2025

Uh oh!

shewu-quic commented Oct 30, 2025

Uh oh!

cccclai commented Oct 31, 2025

Uh oh!

shewu-quic commented Nov 3, 2025

Uh oh!

cccclai Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

shewu-quic Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Oct 29, 2025 •

edited

Loading